The pattern of name tokens in narrative clinical text and a comparison of five systems for redacting them

نویسندگان

Mehmet Kayaalp

Allen C. Browne

Fiona M. Callaghan

Zeyno A. Dodd

Guy Divita

Selcuk Ozturk

Clement J. McDonald

چکیده

OBJECTIVE To understand the factors that influence success in scrubbing personal names from narrative text. MATERIALS AND METHODS We developed a scrubber, the NLM Name Scrubber (NLM-NS), to redact personal names from narrative clinical reports, hand tagged words in a set of gold standard narrative reports as personal names or not, and measured the scrubbing success of NLM-NS and that of four other scrubbing/name recognition tools (MIST, MITdeid, LingPipe, and ANNIE/GATE) against the gold standard reports. We ran three comparisons which used increasingly larger name lists. RESULTS The test reports contained more than 1 million words, of which 2388 were patient and 20,160 were provider name tokens. NLM-NS failed to scrub only 2 of the 2388 instances of patient name tokens. Its sensitivity was 0.999 on both patient and provider name tokens and missed fewer instances of patient name tokens in all comparisons with other scrubbers. MIST produced the best all token specificity and F-measure for name instances in our most relevant study (study 2), with values of 0.997 and 0.938, respectively. In that same comparison, NLM-NS was second best, with values of 0.986 and 0.748, respectively, and MITdeid was a close third, with values of 0.985 and 0.796 respectively. With the addition of the Clinical Center name list to their native name lists, Ling Pipe, MITdeid, MIST, and ANNIE/GATE all improved substantially. MITdeid and Ling Pipe gained the most--reaching patient name sensitivity of 0.995 (F-measure=0.705) and 0.989 (F-measure=0.386), respectively. DISCUSSION The privacy risk due to two name tokens missed by NLM-NS was statistically negligible, since neither individual could be distinguished among more than 150,000 people listed in the US Social Security Registry. CONCLUSIONS The nature and size of name lists have substantial influences on scrubbing success. The use of very large name lists with frequency statistics accounts for much of NLM-NS scrubbing success.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syntactic Complexity of Russian Unified State Exam Texts in English: A Study on Reliability and Validity

In this study we analyze texts used in Russian Unified State Exam on English language. Texts that formed small research corpora were retrieved from 2 resources: official USE database as a reference point, and popular website used by pupils for USE training “Neznaika” (https://neznaika.pro/). The size of two corpora is balanced: USE has 11934 tokens and “Neznaika” - 11918 tokens. We share Biber’...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

Material Development and English for Academic Purposes Word Lists; a Reductionist Approach

Nagy (1988) states that vocabulary is a prerequisite factor in comprehension. Drawing upon a reductionist approach and having in mind the prospects for material development, this study aimed at creating an English for Academic Purposes Word List (EAPWL). The corpus of this study was compiled from a corpus containing 6479 pages of texts, 2,081,678 million tokens (running words) and 63825 types (...

متن کامل

A Comparison between Three Methods of Language Sampling: Freeplay, Narrative Speech and Conversation

Objectives: The spontaneous language sample analysis is an important part of the language assessment protocol. Language samples give us useful information about how children use language in the natural situations of daily life. The purpose of this study was to compare Conversation, Freeplay, and narrative speech in aspects of Mean Length of Utterance (MLU), Type-token ratio (TTR), and the numbe...

متن کامل

MEMRI’s Narrative of Iran in the Context of Current US-Iran Tensions

Drawing on narrative theory and the notion of framing, this paper focused on the translated material from the Islamic Republic of Iran‟s media outlets in the website of the Middle East Media Research Institute (MEMRI) to explore how this institute constructed its desired narratives about Iran in the context of the current tensions between Iran and the U.S. In so ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 21 شماره

صفحات -

تاریخ انتشار 2014

The pattern of name tokens in narrative clinical text and a comparison of five systems for redacting them

نویسندگان

چکیده

منابع مشابه

Syntactic Complexity of Russian Unified State Exam Texts in English: A Study on Reliability and Validity

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Material Development and English for Academic Purposes Word Lists; a Reductionist Approach

A Comparison between Three Methods of Language Sampling: Freeplay, Narrative Speech and Conversation

MEMRI’s Narrative of Iran in the Context of Current US-Iran Tensions

عنوان ژورنال:

اشتراک گذاری